{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# K-Nearest-Neighbor"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "In this document we will be exploring the creation of a K-Nearest-Neighbor alogrithm for digit classifcation on the MNIST dataset. We will take the 60,000 images run 10-fold cross validation on them to find the optimal value of k nearest neighbors, then we will run the algorithim on the test data provided and determine the accuracy, confidence interval for 95% and create a confusion matrix for the results. Then we will repeat the experiment with a sliding window technique in the KNN and run a significance test on the algorithim to determine which one preformed better."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### Import Required Libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from __future__ import division\n",
    "import struct\n",
    "import gzip\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from math import sqrt\n",
    "import seaborn as sn\n",
    "import matplotlib.pyplot as plt\n",
    "from scipy.spatial.distance import cdist\n",
    "from scipy.spatial.distance import euclidean\n",
    "from tqdm import tqdm_notebook as tqdm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### A function to read the MNIST data set as a numpy array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def read_idx(filename):\n",
    "    with gzip.open(filename) as f:\n",
    "        zero, data_type, dims = struct.unpack('>HBB', f.read(4))\n",
    "        shape = tuple(struct.unpack('>I', f.read(4))[0] for d in range(dims))\n",
    "        return np.frombuffer(f.read(), dtype=np.uint8).reshape(shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### Declare our dataset arrays"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a numpy array for the training data from the mnist dataset \n",
    "raw_train = read_idx('data/train-images-idx3-ubyte.gz')\n",
    "## Flatten the training array\n",
    "train_data = np.reshape(raw_train, (60000, 28 * 28))\n",
    "train_label = read_idx('data/train-labels-idx1-ubyte.gz')\n",
    "\n",
    "# Create a numpy array for the test data from the mnist dataset\n",
    "raw_test = read_idx('data/t10k-images-idx3-ubyte.gz')\n",
    "## Flatten the test array\n",
    "test_data = np.reshape(raw_test, (10000, 28 * 28))\n",
    "test_label = read_idx('data/t10k-labels-idx1-ubyte.gz')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Format new data into 30x30 for sliding window technique\n",
    "\n",
    "raw_test_data_30x30 = []\n",
    "\n",
    "for i in range(len(raw_test)):\n",
    "    raw_test_data_30x30.append(np.pad(raw_test[i], ((1,1), (1,1)), 'constant', constant_values=((0,0), (0,0))))\n",
    "    \n",
    "raw_test_data_30x30 = np.array(raw_test_data_30x30)\n",
    "test_data_30x30 = np.reshape(raw_test_data_30x30, (10000, 30 * 30)) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### Euclidean Distance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def euclideanDistance(X, Y):\n",
    "    return np.linalg.norm((X - Y));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### Find the Nearest Neighbors\n",
    "This function finds the nearest neighbors by creating an array the same size as the training data and populates that array with the image that we are trying to find the distances for compared to the other training data. Then the array is sorted based on the first k elements and the indexs are recorded and returned.\n",
    "\n",
    "In the sliding windows approach we loop over the 30x30 image with boxes of 28x28 and calculate the euclidean distance for each box, then store the results in an array. After the image has been cropped 9 different times we take the smallest distance and use that as the distance for that particular piece of test data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def nearestNeighbors(train_data, i, k, slidingWindow):\n",
    "    distances = []\n",
    "    n = train_data.shape[0]\n",
    "    \n",
    "    if slidingWindow:\n",
    "        for j in range(len(train_data)):\n",
    "            cropDistances = []\n",
    "            for k in range(9):\n",
    "                crop = crops[i][k]\n",
    "                cropDistances.append(euclideanDistance(crop, train_data[j]))\n",
    "            distances.append(np.sort(cropDistances)[0])\n",
    "    else:\n",
    "        for j in range(len(train_data)):\n",
    "            distances.append(distance_array[i][j])\n",
    "        \n",
    "    \n",
    "    neighbors_idxs = np.argsort(distances)[:k]\n",
    "    \n",
    "    return neighbors_idxs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def nearestNeighborsOptimal(i, k, k_fold, subset_size):\n",
    "    distances = []\n",
    "    \n",
    "    for j in range(k_fold * subset_size):\n",
    "        distances.append(distance_array[i][j])\n",
    "        \n",
    "    for j in range(((k_fold + 1) * subset_size), len(train_data)): \n",
    "        distances.append(distance_array[i][j])\n",
    "                   \n",
    "    neighbors_idxs = np.argsort(distances)[:k] \n",
    "    \n",
    "    return neighbors_idxs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### K Nearest Neighbor"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The K Nearest Neighbor algorithim works by looping through each value in the test_data, determines the nearest neighbor, then predicts the result based on the point that it's closest to.\n",
    "\n",
    "The function returns an array of predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def kNearestNeighbor(train_data, train_labels, test_data, k, slidingWindow=False):\n",
    "    predictions = []\n",
    "    \n",
    "    for i in tqdm(range( len( test_data ) )):\n",
    "        neighbors = nearestNeighbors(train_data, i, k, slidingWindow)\n",
    "        predictions.append(predict(neighbors, train_labels))\n",
    "\n",
    "    return np.array(predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def kNearestNeighborOptimal(k, k_fold, subset_size, training_labels):\n",
    "    predictions = []\n",
    "#     print(\"Test data: (\",(k_fold * subset_size),\",\", ((k_fold  + 1) * subset_size), \")\")\n",
    "    for i in tqdm( range((k_fold * subset_size), ((k_fold  + 1) * subset_size)) ):\n",
    "        neighbors = nearestNeighborsOptimal(i, k, k_fold, subset_size)\n",
    "        predictions.append(predict(neighbors, training_labels))\n",
    "        \n",
    "    return np.array(predictions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### Cross Validation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This function runs the cross validation on the specified number of folds. We are using 10-fold cross validation so that is the only value that will be passed later on in our code.\n",
    "\n",
    "This works by looping through the array 10 times and uses each subset that is 1/10 the size as testing data, at the same time using the rest as training data. The accuracies are then recorded and averaged to get the mean accuracy for that particular test. We run this function for every value of k to find the optimal amount of nearest neighbors.\n",
    "\n",
    "The usefulness of running cross validation is the fact that it gives us an idea of the performance on the entire dataset rather than a specific subset in every case. This essentially gives us a more diverse group of testing data and a more accurate accuracy with the same data set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def crossValidation(num_folds, k):\n",
    "    subset_size = len(train_data) // num_folds\n",
    "    accuracyList = []\n",
    "    \n",
    "    for i in range(num_folds):\n",
    "        testing_labels = train_label[i*subset_size:(i + 1)*subset_size]\n",
    "        training_labels = np.concatenate((train_label[:i*subset_size], train_label[(i + 1)*subset_size:]), axis=0)\n",
    "        predictions = kNearestNeighborOptimal(k, i, subset_size, training_labels)\n",
    "        accuracyList.append(accuracy(predictions, testing_labels))\n",
    "        \n",
    "    meanAccuracy = sum(accuracyList) / float(len(accuracyList))\n",
    "    print(\"The accuracy when k=\" + str(k) + \" is: \"+ str(meanAccuracy), flush=True)  \n",
    "    return meanAccuracy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### Predict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This function takes in an array of neighbors and the training labels then adds the label of the neighbor to the results array for each nearest neighbor.\n",
    "\n",
    "The function then returns the prediction based off what label appears with the greatest frequency in the results array. This is done with the numpy np.bincount function that finds the greatest frequencies in an array format, then uses the np.argmax function to find the index of the max labels that appears and returns the result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def predict(neighbors, train_labels):\n",
    "    results = []\n",
    "    for idx in neighbors:\n",
    "        results.append(train_labels[idx])\n",
    "    results = np.array(results)\n",
    "    return np.argmax(np.bincount(results));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### Accuracy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def accuracy(predictions, test_labels):\n",
    "    return np.sum(predictions == test_labels) / len(test_labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### Find the Optimal K"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finds the optimal value of k by looping through k values 1-10 and determines the optimal k by saving the k with the highest classfication accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def findOptimalK():\n",
    "    optimal_k = 1\n",
    "    optimal_accuracy = -1\n",
    "    k_min = 1\n",
    "    k_max = 10\n",
    "    k_accuracy = -1;\n",
    "\n",
    "    for i in range(6, k_max + 1):\n",
    "        k_accuracy = crossValidation(10, i)\n",
    "        \n",
    "        if k_accuracy > optimal_accuracy:\n",
    "            optimal_k = i\n",
    "            optimal_accuracy = k_accuracy\n",
    "\n",
    "    #optimal_k = k_accuracy.index(max(k_accuracy)) + 1\n",
    "    print(\"The optimal value of k is: \",optimal_k,\" with an accuracy of \",(100 * optimal_accuracy), \"%\")\n",
    "    return optimal_k"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### Calculate Confidence Interval"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def calculateConfidenceInterval(accuracy, length, z):\n",
    "    \n",
    "    interval = z * sqrt( (accuracy * (1 - accuracy)) / length )\n",
    "    \n",
    "    return interval"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### Create Confusion Matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def createConfusionMatrix(predictions, test_labels):\n",
    "    \n",
    "    y_actu = pd.Series(test_labels, name='Actual')\n",
    "    y_pred = pd.Series(predictions, name='Predicted')\n",
    "    df_confusion = pd.crosstab(y_actu, y_pred)\n",
    "    \n",
    "    df_cm = pd.DataFrame(df_confusion)\n",
    "    sn.set(font_scale=1.4)\n",
    "    plt.figure(figsize = (12,8))\n",
    "    sn.heatmap(df_cm, annot=True,annot_kws={\"size\": 16}, fmt='g')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### Image Cropper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def imageCropper(img, xstart, ystart):\n",
    "    return img[ystart: ystart + 28, xstart: xstart + 28]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# K Nearest Neighbor k=1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "distance_array = cdist(test_data, train_data, 'euclidean')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "k1_predictions = kNearestNeighbor(train_data, train_label, test_data, 1)\n",
    "\n",
    "k1_accuracy = accuracy(k1_predictions, test_label)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Classifcation accuracy:\", k1_accuracy * 100, \"%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Find the Optimal K"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Find the optimal value of k\n",
    "distance_array = cdist(train_data, train_data, 'euclidean')\n",
    "optimal_k = findOptimalK()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Standard K Nearest Neighbor Statistics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "del distance_array\n",
    "## Uses a distance array and calculates the euclidean distance for it, speeds up execution.\n",
    "distance_array = cdist(test_data, train_data, 'euclidean')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "standard_predictions = kNearestNeighbor(train_data, train_label, test_data, optimal_k)\n",
    "\n",
    "standard_accuracy = accuracy(standard_predictions, test_label)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "standard_interval = calculateConfidenceInterval(standard_accuracy, len(test_label), 1.96)\n",
    "\n",
    "print(\"Classifcation accuracy:\", standard_accuracy * 100, \"%\")\n",
    "print(\"Confidence interval for 95%: (\", (standard_accuracy - standard_interval) * 100 ,\"%,\",(standard_interval + standard_accuracy) * 100, \"%)\")\n",
    "createConfusionMatrix(standard_predictions, test_label)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sliding Windows K Nearest Neighbor Statistics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "crops = np.zeros((10000, 9, 784))\n",
    "for k in tqdm(range(len(raw_test_data_30x30))):\n",
    "    index = 0\n",
    "    for i in range(3):\n",
    "        for j in range(3):\n",
    "            crop = imageCropper(raw_test_data_30x30[k], i, j)\n",
    "            crop = np.reshape(crop, (28 * 28))\n",
    "            crops[k][index] = crop\n",
    "            index = index + 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sliding_predictions = kNearestNeighbor(train_data, train_label,  raw_test_data_30x30, 3, slidingWindow=True)\n",
    "\n",
    "sliding_accuracy = accuracy(sliding_predictions, test_label)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sliding_interval = calculateConfidenceInterval(sliding_accuracy, len(test_label), 1.96)\n",
    "\n",
    "print(\"Classifcation accuracy:\", sliding_accuracy * 100, \"%\")\n",
    "print(\"Confidence interval for 95%: (\", (sliding_accuracy - sliding_interval) * 100 ,\"%,\",(sliding_interval + sliding_accuracy) * 100, \"%)\")\n",
    "createConfusionMatrix(sliding_predictions, test_label)\n",
    "                \n",
    "                "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Significance Testing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
